An introduction to \(p\)-value functions

Dr. Denis Infanger

01 October 2025

My promises

After this presentation, you will …

  • know what \(p\)-value functions are
  • know how \(p\)-value functions are constructed
  • be able to interpret \(p\)-value functions correctly
  • be able to use \(p\)-value functions to improve statistical and quantitative reasoning
  • be able to use \(p\)-value functions to compare study results

Research has a problem

Many alternatives to classical frequentist statistics and hypothesis testing in particular have been proposed

  • Bayesian statistics
  • Lowering significance threshold from \(0.05\) to \(0.005\)
  • Focus on effect sizes and uncertainty intervals
  • My opinion: \(P\)-values are here to stay
    • Improve interpretation of tools that are used
    • \(P\)-value functions: Graphical tool to achieve that

\(P\)-values

Definition

\(P\) is the probability under the null hypothesis of obtaining a test statistic at least as extreme as the one obtained. “Extreme” means farther from the null value.

Image adapted from here.

Informally, \(p\)-values are a continuous measure of compatibility between the data and a hypothesis, given a set of background information (for details, see here).

Just one \(p\)-value?

  • “The null hypothesis states that there is no effect/no difference.” No!
  • \(P\)-values in papers often have implicit hypotheses: \(\operatorname{H}_{0}:\theta = \color{red}{0}\) vs. \(\operatorname{H}_{1}:\theta\neq \color{red}{0}\).
  • In other words: The \(p\)-value is for a test against the null value of \(0\).
  • But: You can calculate \(p\)-values for other null values too!
  • Example: \(\operatorname{H}_{0}:\theta = \color{red}{2.5}\) vs. \(\operatorname{H}_{1}:\theta\neq \color{red}{2.5}\).

Confidence intervals and their relation to \(p\)-values

Code
df <- data.frame(est = 0.5, lwr = 0.5 - 0.75, upr = 0.5 + 0.75)
theme_set(theme_bw())
ggplot(df, aes(x = est, y = 0)) + 
  geom_vline(xintercept = 0, linetype = 2, linewidth = 0.75) +
  geom_point(size = 4) +
  geom_errorbarh(aes(xmin = lwr, xmax = upr), height = 0, linewidth = 1) +
  labs(
    x = "Units of measurement"
    , y = ""
  ) +
  scale_x_continuous(limits = c(-2, 2), breaks = scales::pretty_breaks(n = 10)) +
  theme(
    axis.title.y=element_text(colour = "black", size = 19, hjust = 0.5, margin=margin(0,12,0,0)),
    axis.title.x=element_text(colour = "black", size=19, margin=margin(10,0,0,0)),
    # axis.title.y=element_text(size=15,hjust=0.5, vjust=1),
    axis.text.x=element_text(colour = "black", size=15),
    axis.text.y=element_blank(),
    legend.position="top",
    legend.text=element_text(size=20),
    legend.key.size = unit(3, "line"),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank(),
    legend.key=element_blank(),
    plot.title = element_text(face = "bold"),
    legend.title=element_text(size=20, face = "bold"),
    strip.text.x=element_text(size=20)
    , axis.ticks.x=element_blank()
    , axis.ticks.y=element_blank()
  )

95% Confidence interval for an arbitrary parameter
  • The confidence interval (CI) displays the values that are compatible with the data and the model.
  • All values within the CI have \(p>0.05\) (are not “significant”).
  • Values outside the CI have \(p\leq 0.05\) (are “significant”).
  • Q: What’s the \(p\)-value for the values at the edge of the CI? \(p=0.05\) (for a 95%-CI)

Question

In your opinion, do the results justify the conclusion that there was no association between sitting time and diabetes? Why or why not?

Confidence intervals

  • Confidence intervals: Range of supported/compatible values given the data.
  • But:
    • Are all values inside the CI equally well supported?
    • Are all values outside the CI equally badly supported?
    • Why does the support of values decrease sharply at the edges of the CI?

95% confidence interval for Stamatakis et al. (2017)

Some ideas for improvement

  • Plot multiple confidence intervals together:
    • 99%
    • 95%
    • 90%
  • Calculate and plot \(p\)-values for other values than \(1\) (the usual null value for ratio measures):
    • For example: \(P\)-value for \(\operatorname{H}_{0}:\) HR = \(1.5\) is \(0.0975\).

Plot multiple confidence intervals for the same estimate

Code
# Stamatakis et al. (2017)

est <- log(1.19)
se <- 0.1397

alphas <- c(0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)

z_vals <- qnorm(1 - alphas/2)

lwr <- est - z_vals*se
upr <- est + z_vals*se

plot_dat <- data.frame(
  pointest = exp(est)
  , lwr = exp(lwr)
  , upr = exp(upr)
  , alphasrev = 1 - alphas
  , alphas = alphas
)

text_frame <- data.frame(
  y = c(plot_dat$alphasrev, 0)
  , x = c(plot_dat$lwr, plot_dat$pointest[1])
)

text_frame$label <- paste0(text_frame$y*100, "%")

scaleFUN <- function(x) sprintf("%.2f", x)

theme_set(theme_bw())
p <- ggplot(plot_dat, aes(y = alphasrev, x = pointest)) +
  geom_blank() +
  geom_vline(xintercept = 1, linetype = 2, linewidth = 0.5) +
  geom_point(aes(x = pointest, y = 0), size = 3) +
  geom_point(aes(x = lwr), size = 3) +
  geom_point(aes(x = upr), size = 3) +
  geom_segment(aes(x = lwr, y = alphasrev, xend = upr, yend = alphasrev), linewidth = 0.75) + 
  # geom_vline(xintercept = 1, linetype = 2, size = 0.6) +
  xlab("Hazard ratio") +
  ylab(expression("Confidence level "*(1 - alpha))) +
  # ylab("") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10), trans = "log", limits = c(0.8, 1.72)) +
  scale_y_continuous(breaks = seq(0, 1, 0.05), labels=scaleFUN, trans = "reverse") +
  # scale_colour_manual(values = c("#C16700", "#008FD0"),  name = "") +
  theme(
    axis.title.y=element_text(colour = "black", size = 19, hjust = 0.5, margin=margin(0,12,0,0)),
    axis.title.x=element_text(colour = "black", size=20),
    # axis.title.y=element_text(size=15,hjust=0.5, vjust=1),
    axis.text.x=element_text(colour = "black", size=15),
    axis.text.y=element_text(colour = "black", size=15),
    # plot.margin=unit(c(2,2,2,2,2),"line"),
    legend.position="top",
    legend.text=element_text(size=20),
    legend.key.size = unit(3, "line"),
    panel.grid.minor = element_blank(),
    # panel.grid.major = element_line(colour=grey(0.8), size=0.5),
    legend.key=element_blank(),
    plot.title = element_text(face = "bold"),
    legend.title=element_text(size=20, face = "bold"),
    # legend.key.width=unit(.01,"npc"),
    # legend.key.height=unit(.025,"npc"),
    # strip.background=element_rect(fill="white")
    strip.text.x=element_text(size=20)
    , axis.ticks.x=element_blank()
    # , axis.ticks.y=element_blank()
  ) +
  geom_label(
    data = text_frame
    , mapping = aes(x = x, y = y, label = label)
    , fill = "white"
    , inherit.aes = FALSE
    , label.size = 0.5
    , parse = FALSE
    , size = 5.5
    , hjust = 1.3
  )

p

Plot all confidence intervals, calculate all \(p\)-values

Code
# Now "true" p-value function

stama <- conf_dist(
  estimate = c(
    0.1730998 # Stamatakis total sitting time
    # , log(1.10) # Petersen total sitting time (both sexes)
  )
  , stderr = c(
    # 0.1336387
    0.1397
  )
  , type = "coxreg"
  , plot_type = "p_val"
  , n_values = 1e4L
  , conf_level = c(0.95)
  , null_values = log(c(1))
  , trans = "exp"
  , alternative = "two_sided"
  , log_yaxis = FALSE
  , cut_logyaxis = 0.05
  , xlab = "Hazard ratio"
  , xlim = log(c(0.7, 1.9))
  , together = FALSE
  , plot_p_limit = 1-0.999
  , plot_counternull = FALSE
)

A \(p\)-value function

  • All possible confidence intervals (all confidence levels)
  • All possible \(p\)-values (two- and one-sided)

The width of the \(p\)-value function indicates the precision of the estimate

We can read off any \(p\)-value from the \(p\)-value function

Null misinterpretation

A better summary

“Our results are most compatible at the 95% level with an effect of high versus low amounts of sitting anywhere from an 8% hazard reduction to 55% increase in the hazard of diabetes.”

Implication: Ibuprofen is safe in terms of renal adverse effects (AE).

Walsh et al. (2018): Safety of Ibuprofen

Code
walsh <- conf_dist(
  estimate = c(
    log(1.84)
  )
  , stderr = c(
    0.526
  )
  , type = "general_z"
  , plot_type = "p_val"
  , n_values = 1e4L
  , conf_level = c(0.95)
  , null_values = log(c(1))
  , trans = "exp"
  , alternative = "two_sided"
  , log_yaxis = FALSE
  , cut_logyaxis = 0.05
  , xlab = "Incidence rate ratio"
  , xlim = log(c(0.5, 6))
  , together = FALSE
  , plot_p_limit = 1-0.9999
  , plot_counternull = FALSE
)

How can we find the \(p\)-value of the hypothesis
\(\operatorname{H}_0:\) IRR = \(1\) from the plot? \(p \approx 0.25\)

Q: How would you summarize the results?

Walsh et al. (2018): Better conclusion

  • The region of high (95%) compatibility for the incidence rate ratio: 0.66 to 5.19.
  • Anything from a 34% rate reduction up to a fivefold increase!

“Our study lacked sufficient information to reach useful inference about adverse renal events comparing ibuprofen to acetaminophen alone.”

Does high power plus nonsignificance support the null?

Table 1: Hypothetical randomized trial data
New treatment Placebo
Adverse events 48 32
Total 1000 1000
  • High power for \(\operatorname{RR} = 2\) at \(\alpha = 0.05\) (over 85%)
  • \(\widehat{\operatorname{RR}} = 1.50\) (95%-CI: \(0.97\) to \(2.33\))
  • \(P = 0.07\) for \(\operatorname{H}_{0}: \operatorname{RR} = 1\)
  • Does this mean that the study supports that new treatment is equal to placebo in terms of AEs? No!

Does high power plus nonsignificance support the null?

Code
walsh <- conf_dist(
  estimate = c(
    0.405465108108164
  )
  , stderr = c(
    0.2237931
  )
  , type = "general_z"
  , plot_type = "p_val"
  , n_values = 1e4L
  , conf_level = c(0.95)
  , null_values = log(c(1))
  , trans = "exp"
  , alternative = "two_sided"
  , log_yaxis = FALSE
  , cut_logyaxis = 0.05
  , xlab = "Risk ratio"
  # , xlim = log(c(0.5, 6))
  , together = FALSE
  , plot_p_limit = 1-0.9999
  , plot_counternull = FALSE
)

Based on the results, do the data favor \(\operatorname{RR} = 1\) or \(\operatorname{RR} = 2\)?

  • The point estimate is \(1.5\): Closer to \(2\) than to \(1\) in proportional terms.
  • \(P\)-value for \(\operatorname{RR} = 2\) is \(0.20\), 3 times the \(p\)-value for \(\operatorname{RR} = 1\).
  • \(1\) is closer to lower CI limit than \(2\) is to the upper limit (in proportional terms).

Summary: Despite “nonsignificance” and power approximating 90% for \(\operatorname{RR} = 2\), the results favor \(\operatorname{RR} = 2\) over \(\operatorname{RR} = 1\)! (Greenland, 2012)

Comparing studies using \(p\)-value functions

Implication: No benefit for women with high alcohol consumption.

Code
stampfer <- conf_dist(
  estimate = c(
    log(0.81)
    , log(0.82)
  )
  , stderr = c(
    0.0704862
    , 0.163609
  )
  , type = "general_z"
  , plot_type = "p_val"
  , est_names = c("Moderate intake", "High intake")
  , n_values = 1e4L
  , conf_level = c(0.95)
  , null_values = log(c(1))
  , trans = "exp"
  , alternative = "two_sided"
  , log_yaxis = FALSE
  , cut_logyaxis = 0.05
  , xlab = "Relative risk for cognitive impairment"
  , xlim = log(c(0.4, 1.5))
  , together = TRUE
  , plot_p_limit = 1-0.9999
  , plot_counternull = FALSE
)

Stampfer et al. (2005): A better summary

  • No basis to infer that the effect size differs for moderate and heavy drinkers.
  • The hypothesis that is most compatible with the data is almost the same in both groups.
  • Lack of significance was driven by lower precision for heavy drinkers.

NSAIDs and atrial fibrillation risk

Code
schmidt <- conf_dist(
  estimate = c(
    log(1.20)
    , log(1.20)
  )
  , stderr = c(
    0.107002
    , 0.0524792
  )
  , type = "general_z"
  , plot_type = "p_val"
  , est_names = c("Chao et al. (2013)", "Schmidt et al. (2011)")
  , n_values = 1e4L
  # , conf_level = c(0.95)
  , null_values = log(c(1))
  , trans = "exp"
  , alternative = "two_sided"
  , log_yaxis = FALSE
  , cut_logyaxis = 0.05
  , xlab = "Risk ratio"
  , xlim = log(c(0.9, 1.55))
  , together = TRUE
  , plot_p_limit = 1-0.9999
  , plot_counternull = FALSE
)

Code
manal <- metafor::rma(
  yi = c(log(1.20), log(1.20))
  , vi = c(0.107002 , 0.0524792)^2
  , method = "REML"
  # , test = "knha"
)

schmidt_meta <- conf_dist(
  estimate = c(
    log(1.20)
    , log(1.20)
    , manal$b
  )
  , stderr = c(
    0.107002
    , 0.0524792
    , manal$se
  )
  , type = "general_z"
  , plot_type = "p_val"
  , est_names = c("Chao et al. (2013)", "Schmidt et al. (2011)", "Meta-Analyis")
  , n_values = 1e4L
  # , conf_level = c(0.95)
  , null_values = log(c(1))
  , trans = "exp"
  , alternative = "two_sided"
  , log_yaxis = FALSE
  , cut_logyaxis = 0.05
  , xlab = "Risk ratio"
  , xlim = log(c(0.9, 1.55))
  , together = TRUE
  , plot_p_limit = 1-0.9999
  , plot_counternull = FALSE
)

Summary \(p\)-value functions

  • Display all possible confidence intervals/\(p\)-values
  • Tip: Point estimate
  • Width: Determined by standard error (precision)
  • Standard error can be extracted from confidence intervals or from \(p\)-value if not provided.
  • Aliases: Confidence curve, compatibility curve, consonance curve, consonance function.

Literature

Software: R package pvaluefunctions

https://cran.r-project.org/package=pvaluefunctions

https://github.com/DInfanger/pvaluefunctions

Software: R package parameters

Allows the creation of \(p\)-value functions directly from models using the function p_function. Here is an example from a linear regression model:

Code
mod <- lm(mpg ~ wt + as.factor(gear) + am, data = mtcars)
p_curve <- p_function(mod, ci_levels = c(emph = 0.95))

plot(p_curve, n_columns = 2)

Take-Home-Messages

  • The practice of dichotomizing results into “significant” and “not significant” is not informative (Gelman & Stern, 2006).
  • Focus on effect sizes and confidence/compatibility intervals.
  • Create \(p\)-value functions to summarize the available evidence:
    • Values with high and low compatibility with the data.
    • To compare evidence from different studies.
  • Never say that “we found no effect/association” or “there was no difference” when \(p>0.05\) (Greenland et al., 2016)